Gathering Sensor Data
- Status: accepted
- Deciders: Artur Wojnar, Andres Lamont
- Date: 2022-08-01
Technical Story: https://orikami.atlassian.net/browse/DBP-455
Context and Problem Statement
In the linked issue a patient would generate up to 2GB of data in a daily measurement which is taken once per month. But we know that this is a requirement for a one study, but in future we will be taking bigger loads, thus we can think of providing an architecture that would be suitable for gathering information from IoT devices.
Decision Drivers
-
Scalability. We should be able to spread the load among all of our gateway’s instances
-
A known format/structure of sensor data sending to the API to not figure out enything by our own
-
Use of ready-to-use components and standards
-
Do not persist sensor data in the Apache Pulsar
-
Be able to handle a constant load of sensors' data
-
Be able to get data sensors from specific period for analysis purpose (e.g. of half year)
-
Do not overburden the Gateway with constant requests
-
Off load older data out of the database
-
Cut off on the data sending from the client
Considered Options
- Separate time-series collection in MongoDB (version >= 5.2). The granurality should be “seconds”, because the mobile stores the data in batches and sends them every 2 seconds. Reference.Configure the Atlas MongoDB Archive (Reference). !Attention!: Online Archive for timeseries collection is available as a Preview. The feature and corresponding documentation may change at any time in the Preview stage.We have to configure the Archive rules by our API and Atlas API, because we store data per tenant so we have multiple time series collections. API Reference.
- Protocol communication
- Websockets to handle the sensors data load. Since the Pulsar supports WS we can proxy the data through the Gateway, so we would have an open socket 2 open sockets for one patient between the client-gateway and the latter for the gateway-pulsar connection.The Pulsar gives the possibility to use the WS protocol with it, but it’s pointless in out case; As the docs say: “Pulsar WebSocket API provides a simple way to interact with Pulsar using languages that do not have an official client library.” (NodeJS has a native pulsar-client). Referece.The gateway has to do JWT authentication - the JWT token will be passed through the WS endpoint and validated on the client’s connection initialization on the Gateway.With the native Pulsar’s WS connection we can forward the stream directly from the client to the Pulsar, but it requires creating a separate Pulsar connection.
- HTTP/2 | gRPC (Client streaming configuration. Reference.) + Protobuf. It’s actually HTTP/2 protocol, so it accelerates speed data flow. Reference.gRPC streaming allows to have limited number of open HTTP/2 streams, the default it’s 250 but it can be increased up to 5000 (traefik link). Above the limit the next RPCs would be queued (link).Remember to use keep-alive ping.We have to forward the message from the stream to the Pulsar, but we can just send the binary data we get wrapped in an envelope with some metadata. The same way as for the WS.
- REST API. Just for documentary purpose. Not really considered.
Decision Outcome
Chosen option: option 1 and 2 because we don’t have better alternatives. Option 3b because HTTP/2 is still more HTTP than WS, we can send binary data in a contracted way with protobuf and we have a protocol-compliant way to deal with the OAuth2. Moreover, in future we can use gRPC in microservices communication and also we can use multiplexing if we need to send more streams from within one client.Pulsar communication stays untached, we have a created producer that will be used to send the stream.Later on, if further optimizations are needed, we can use a MongoDB sink combined with the Pulsar.
Positive Consequences
- Significantly improved performance while dealing with streaming data from IoT devices.
Negative Consequences
- Small impact on the needed implementation and enhancements plus learning curve of gRPC streaming.
Pros and Cons of the Options
[option 1]
It’s the only reliable options for this purpose. We have to make sure we run MongoDB version >= 5.2 to have provided optimizations for Time Series feature, like compression which we really need to cut down on size because our data will be similar to each other where values will differ only by bits.
We may not need old data, so using the built-in Atlas feature is perfect match for us. The only cons of it is the necessity of manual or automated way to configure Archive rules. Each rule is linked to one database and one collection.There can be long-running studies that can hold data for long time. We don’t know much abut that case, maybe we will consider a data leak in future but for now the Archive rules are enough.Cons is that this feature is still a Preview, but it’s resonable to expect it becomes stable soon.
[option 2a]
Pros:
-
Traefic supports the WS out of the box (link)
-
The solution is stateless
-
Challenging the oauth on connection
-
Theoretically, bigger number of concurrent open connections
Cons:
-
No built-in binary protocol like Protobuf
-
It’s not HTTP
-
No built-on oauth support (tweak)
-
learning curve
-
problems with scalability after initiating a connnecti
[option 2b]
Pros:
-
Traefic supports the WS out of the box (link)
-
The solution is stateless
-
It’s still HTTP/2, no new protocol
-
Predefined contract for binary data structure
-
Multiplexing
-
gRPC can be used later for the internal microservices communication (if needed)
-
built-in oauth2 support
-
suitable for streams and real-time data, having (likely) better performance than WS
Cons:
- learning curve
[option 2c]
Props:
-
no need for change in our source code
-
easy
Cons:
-
heavyweight
-
increased latency
-
traffic increased significantly (requests every second)
-
problems with scalability after initiating a connnection